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Abstract 

This paper studies the effect of discretizing the parametrization of a dictionary used for 
Matching Pursuit decompositions of signals. Our approach relies on viewing the continuously 
parametrized dictionary as an embedded manifold in the signal space on which the tools of 
differential (Ricmannian) geometry can be applied. The main contribution of this paper is 
twofold. First, we prove that if a discrete dictionary reaches a minimal density criterion, then 
the corresponding discrete MP (dMP) is equivalent in terms of convergence to a weakened 
hypothetical continuous MP. Interestingly, the corresponding weakness factor depends on a 
density measure of the discrete dictionary. Second, we show that the insertion of a simple 
geometric gradient ascent optimization on the atom dMP selection maintains the previous 
comparison but with a weakness factor at least two times closer to unity than without opti- 
mization. Finally, we present numerical experiments confirming our theoretical predictions for 
decomposition of signals and images on regular discretizations of dictionary parametrizations. 



Keywords: Matching Pursuit, Ricmannian geometry, Optimization, Convergence, Dictionary, 
Parametrization. 

1 Introduction 

There has been a large effort in the last decade to develop analysis techniques that decompose 
non-stationary signals into elementary components, called atoms, that characterize their salient 
features [1-5]. In particular, the matching pursuit (MP) algorithm has been extensively studied [2, 
6-11] to expand a signal over a redundant dictionary of elementary atoms, based on a greedy 
process that selects the elementary function that best matches the residual signal at each iteration. 
Hence, MP progressively isolates the structures of the signal that are coherent with respect to the 
chosen dictionary, and provides an adaptive signal representation in which the more significant 
coefficients are first extracted. The progressive nature of MP is a key issue for adaptive and 
scalable communication applications [12,13]. 

A majority of works that have considered MP for practical signal approximation and com- 
pression define the dictionary based on the discretization of a parametrized prototype function, 
typically a scaled/modulated Gaussian function or its second derivative [6,14,15]. An orthogonal 
1-D or 2-D wavelet basis is also a trivial example of such a discretization even if in that case 
MP is not required to find signal coefficients; a simple wavelet decomposition is computation- 
ally more efficient. Works that do not directly rely on a prototype function either approximate 
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such a parametrized dictionary based on computationally efficient cascades of filters [16-18], or 
attempt to adapt a set of parametrized dictionary elements to a set of training signal samples 
based on vector quantization techniques [19,20]. Thus, most earlier works define their dictionary 
by discretizing, directly or indirectly, the parameters of a prototype function. 

The key question is then: how should the continuous parameter space be discretized ? A fine 
discretization results in a large dictionary which approximates signals efficiently with few atoms, 
but costs both in terms of computational complexity and atom index entropy coding. Previous 
works have studied this trade-off empirically [6,15]. In contrast, our paper focuses on this question 
in a formal way. It provides a first attempt to quantify analytically how the MP convergence is 
affected by the discretization of the continuous space of dictionary function parameters. 

Our compass to reach this objective is the natural geometry of the continuous dictionary. This 
dictionary can be seen as a parametric (Riemannian) manifold on which the tools of differential 
geometry can be applied. This geometrical approach, of increasing interest in the signal processing 
literature, is inspired by the works [21,22] on Image Appearance Manifolds, and is also closely 
linked to manifolds of parametric probability density function associated to the Fisher information 
metric [23]. Some preliminary hints were also provided in a Riemannian study of generalized 
correlation of signals with probing functions [24]. 

The outcome of our study is twofold. On the one hand, we analyze how the rate of convergence 
of the continuous MP (cMP) is affected by the discretization of the prototype function parameters. 
We demonstrate that the MP using that discretized dictionary (dMP) converges like a weak 
continuous MP, i.e. a MP algorithm where the coefficient of the selected atom at each iteration 
overtakes only a percentage (the weakness factor) of the largest atom magnitude. We describe 
then how this weakness factor decreases as the so-called density radiuso of the discretization 
increases. This observation is demonstrated experimentally on images and randomly generated 
1-D signals. 

On the other hand, to improve the rate of convergence of discrete MP without resorting to a 
finer but computationally heavier discretization, we propose to exploit a geometric gradient ascent 
method. This allows to converge to a set of locally optimal continuous parameters, starting from 
the best set of parameters identified by a coarse but computationally light discrete MP. Each 
atom of the MP expansion is then defined in two steps. The first step selects the discrete set 
of parameters that maximizes the inner product between the corresponding dictionary function 
and the residual signal. The second step implements a (manifolcH) gradient ascent method to 
compute the prototype function parameters that maximize the inner product function over the 
continuous parameter space. As a main analytical result, we demonstrate that this geometrically 
optimized discrete MP (gMP) is again equivalent to a continuous MP, but with a weakness factor 
that is two times closer to unity than for the non-optimized dMP. Our experiments confirm that 
the proposed gradient ascent procedure significantly increases the rate of convergence of MP, 
compared to the non-optimized discrete MP. At an equivalent convergence rate, the optimization 
allows reduction of the discretization density by an order of magnitude, resulting in significant 
computational gains. 

The paper is organized as follows. In Section 2, we introduce the notions of parametric 

This density radius represents the maximal distance between any atom of the continuous dictionary and its 
closest atom in the discretization. 

2 In the sense that this gradient ascent evolves on the manifold induced by the intrinsic dictionary geometry. 
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dictionary in the context of signal decomposition in an abstract Hilbert space. This dictionary is 
then envisioned as a Hilbert manifold, and we describe how its geometrical structure influences 
its parametrization using the tools of differential geometry. Section 3 surveys the definition of 
(weak) continuous MP providing a theoretical optimal rate of convergence for further comparisons 
with other greedy decompositions. A "discretization autopsy" of this algorithm is performed in 
Section 4 and a resulting theorem explaining the dependences of the dMP convergence relatively to 
this sampling is proved. A simple but illustrative example of a 1-D dictionary, the wavelet (affine) 
dictionary, is then given. The optimization scheme announced above is developed in Section 5. 
After a review of gradient ascent optimization evolving on manifolds, the geometrically optimized 
MP is introduced and its theoretical rate of convergence analyzed in a second theorem. Finally, in 
Section 6, experiments are performed for 1-D and 2-D signal decompositions using dMP and gMP 
on various regular discretizations of dictionary parametrizations. We provide links to previous 
related works in Section 7 and conclude with possible extensions in Section 8. 

2 Dictionary, Parametrization and Differential Geometry 

Our object of interest throughout this paper is a general real "signal", i.e. a real function / 
taking value on a measure space X. More precisely, we assume / in the set of finite energy 
signals, i.e. / G L 2 (X, d/i) = {u : X — ► M : ||u|| 2 = f x \u(x)\ 2 d/i(x) < oo}, for a certain 
integral measure dfi(x). Of course, the natural comparison of two functions u and v in L 2 (X, d/i) 
is realized through the scalar product (u, v) L 2i x \ = ( n ; v ) — fx u ( x ) v(x)dfj,(x) making L 2 (X, d/i) 
a HilbertH space where ||u|| 2 = (u,u). 

This very general framework can be specialized to 1-D signal or image decomposition where 
X is given respectively by M. or R 2 , but also to more special spaces like the two dimensional 
sphere S 2 [25] or the hyperboloid [26]. In the sequel, we will write simply L 2 {X) = L 2 (X, d/i). 

In the following sections, we will decompose f over a highly redundant parametric dictionary of 
real atoms. These are obtained from smooth transformations of a real mother function g S L 2 (X) 
of unit norm. Formally, each atom is a function g\(x) = [U(X)g](x) £ L 2 (X), for a certain 
isometric operator U parametrized by elements A € A and such that \\g\\\ = \\g\\ = 1. The 
parametrization set A is a continuous space where each A G A corresponds to P continuous 
components A = {A l }o<i<p-i of different nature. For instance, in the case of 1-D signal or 
image analysis, g may be transformed by translation, modulation, rotation, or (anisotropic) 
dilation operations, each associated to one component A* of A. Our dictionary is then the set 
dict(<?, U, O) = {g\(x) = [U(X)g](x) : A G © } , for a certain subset C A. In the rest of the 
paper, we write dict(O) = dict(g, U, O), assuming g and U implicitly given by the context. For 
the case O = A, we write T> = diet (A). 

We assume that g is twice differentiable over X and that the functions g\(x) are twice differ- 
entiable on each of the P components of A. In the following, we write d{ for the partial derivative 
with respect to A*, i.e. of any element (e.g. g\(x), (g\,u), ...) depending on A, and 

dij = d{dj. From the smoothness of U and g, we have dij = dji on quantities built from these 
two ingredients. 

Let us now analyze the geometrical structure of A. Rather than an artificial Euclidean distance 
3 Assuming it complete, i.e. every Cauchy sequence converges in this space relatively to the norm || ■ || 2 = {•, •). 
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ds(X a , A;,) 2 = J2i(Ki ~ K) 2 between A a , A;, G A, we use a distance introduced by the dictionary 
T> itself seen as a P-dimensional parametric submanifold of L 2 (X) (or a Hilbert manifold^ [27]). 
The dictionary distance d-p is thus the distance in the embedding space L 2 (X), i.e. dj)(X a , Xb) = 
\\g\a - 9\ b \\- 

From this embedding, we can define an intrinsic distance in T>, namely the geodesic distance. 
This later has been used in a similar context in the work of Grimes and Donoho [22] and we follow 
here their approach. For our two points A a , A^, assume that we have a smooth curve 7 : [0, 1] — > A 
with 7(i) = (7°(t), • • • ,7 P_1 (i)), such that 7(0) = A a and 7(1) = A b . The length £(7) of this 
curve in V is thus given by £(7) = Jq g 7 rf\ \\ dt, assuming that g 7 ( t ) is differentiabk-H with 
respect to t. 

The geodesic distance between A a and Xb in A is the length of shortest path between these 
two points, i.e. 

dg(X a ,X b ) = inf / \\^ t g^ t) \\dt, (1) 

where 7(A a — > Xb) is any differentiable curve ^(t) linking A a to Xb for t equals to and 1 
respectively. 

We denote by 7 AqA() the optimal geodesic curve joining A a and A5 on the manifold V, i.e. such 
that £(7 AaA ) = dg(X a ,Xb), and we assume henceforth that it is always possible to define this 
curve between two points of A. Note that by construction, dg(X a ,Xb) = dg(X a ,X') + dg(X',Xb), 
for all A' on the curve 7 AaA 

In the language of differential geometry, the parameter space A is a Riemannian manifold 
M. = (A,Qij) with metric Gij(X) = {dig\,djg\). Indeed, for any differentiable curve 7 : t G 
[-5, 5} — > 7(t) G A with 5 > and 7(0) = A, we have 

H^wLoll 2 = 7W(0)MA), (2) 

with ii(t) = ^n(i), and where Einstein's summation convention is used for simplicity. 

The vector = 7*(0) is by definition a vector in the tangent space T\A of A in A. The 
meaning of relation ([2]) is that the metric Qij(X) allows the definitions of a scalar product and 
a norm in each T\A. The norm of a vector £ G T X A is therefore noted |£| 2 = |C| 2 = C£ j Gij(X), 
with the correspondence ||^ £/-y ) I *=o 1 1 = For the consistency of further Riemannian geometry 
developments, we assume that our dictionary T> is non-degenerate, i.e. that it induces a positive 
definite metric Qij . Appendix [A] provides additional details. 

We conclude this section with the arc length (or curvilinear) parametrization "s" [28] of a 
curve 7(5). It is such that |7'| 2 = 7^(5) 7 /J (s) ^ (7(5)) = 1, where u'(s) = j^u(s). From its 
definition, the curvilinear parameter s is the one which measures at each point 7(5) the length 
of the segment of curve already travelled on 7 from 7(0). Therefore, in this parametrization, 
A « = TA a A 5 (°) and X b = lx a x b ( d G( X a,Xb))- 



4 This is a special case of Image Appearance Manifold (IAM) defined for instance in [21,22]. It is also closely 
linked to manifolds of parametric probability density function associated to the Fisher information metric [23]. 
Another definition of £ exists for non differentiable curve. See for instance [22]. 

6 Namely, a summation in an expression is defined implicitly each time the same index is repeated once as a 
subscript and once as a superscript, the range of summation being always [0, P — 1], so that for instance the 
expression a l bi reads "^fSg 1 a l bi. 
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3 Matching Pursuit in Continuous Dictionary 



Let us assume that we want to decompose a function / G L 2 {X) into simpler elements (atoms) 
coming from a dictionary dict(0), given a possibly uncountable and infinite subset C A. Our 
general aim is thus to find a set of coefficients {c m } such that f{x) is equal or well approximated 
by /appfz) = Y,m c m9\ m {x) with a finite set of atoms {g\ m } C dict(0). 

Formally, for a given weakness factor a G (0,1], a General Weak(a) Matching Pursuit de- 
composition of f [2,29], written MP(0,a), in the dictionary dict(0) is performed through the 
following greedyij algorithm : 

R°f = f, A°f = 0, (initialization), 
Rm+ l f = Rmf _ (g Xm+1 ,R^f)g Xm+1 , (3a) 
A m+l f = A m f + (g Xm+1 ,R^f)g Xm+1 , (3b) 

with : (g Xm+1 , R m f) 2 > a 2 sup Aee (g x , R m f) 2 . (3c) 

The quantity R m+1 f is the residual of / at iteration m + 1. Since it is orthogonal to atom g\ m+1 , 
\\R m+1 f\\ 2 = \\R m f\\ 2 - (g Xm+1 ,R m f) 2 < ||i2 m /|| 2 , so that the energy ||i? m /|| 2 is non-increasing. 
The function A m f is the m-term approximation of / with A m f = Y^k=o id^k+i > R k f) 5A fc+1 - 

Notice that the selection rule ([3c|) concerns the square of the real scalar product (g x ,R m f). 
Matching Pursuit atom selection is typically defined over the absolute value \(g\, R m f)\. How- 
ever, we prefer this equivalent quadratic formulation first to avoid the abrupt behavior of the 
absolute value when the scalar product crosses zero, and second for consistency with the quadratic 
optimization framework to be explained in Section Finally, to allow the non-weak case where 
a = 1, we assume that a maximizer g u G dict(0) of (g,u) 2 always exists for any u £ L 2 (X). 

If is uncountable, our general Matching Pursuit algorithm is named continuous Matching 
pursuit. In particular, for = A, we write cMP(q) = MP(A, a). The rate of convergence (or 
convergence) of the cMP(a), characterized by the rate of decay of ||-R m /|| with m, can be assessed 
in certain particular cases. For instance, if there exists a Hilbert space S C L 2 (X) containing 
T> = diet (A) such that 

I3 2 = inf sup (g x ,u) 2 > 0, (4) 
ues, \\u\\=i AeA 

then the cMP(a) converges inside S. In fact, the convergence is exponential [30] since (g\ m , R m ~ l f) 2 > 
a 2 f5 2 \\R m ~ 1 f\\ 2 and ||i? m /|| 2 < WR™" 1 /]] 2 - a 2 f5 2 \\R m ~ l f\\ 2 < (1 - a 2 /3 2 ) m ||/|| 2 . We name 
(3 = P(S,T>) the greedy factor since it charaterizes the MP convergence (greediness). 

The existence of the greedy factor (3 is obvious for instance for finite dimensional space [30], 
i.e. / G C^, with finite dictionary (finite number of atoms). 

For a finite dictionary in an infinite dimensional space, as L 2 (X), the existence of /3 is not 
guaranteed over the whole space. However, there exists on the space of functions given by 
linear combination of dictionary elements, the number of terms being restricted by the dictionary 
(cumulative) coherence [29]. 

In the case of an infinite dictionary in an infinite dimension space where the greedy factor 
vanishes, cMP(a) convergence is characterized differently on the subspace of linear combination 

7 Greedy in the sense that it does not solve a global £o or l\ minimization [1] to find the coefficients c m of / app 
above, but works iteratively by solving at each iteration step a local and smaller minimization problem. 
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of countable subsets of dictionary elements. This question is addressed separately in a companion 
Technical Report [31] to this article. We now consider only the case where a non-zero greedy factor 
exists to characterize the rate of convergence of MP using continuous and discrete dictionaries. 

4 Discretization effects of Continuous Dictionary 

The greedy algorithm cMP(a) using the dictionary T> is obviously numerically unachievable 
because of the intrinsic continuity of its main ingredient, namely the parameter space A. Any 
computer implementation needs at least to discretize the parametrization of the dictionary, more 
or less densely, leading to a countable set A^ C A. This new parameter space leads naturally to 
the definition of a countable subdictionary T>^ = dict(A ( j). Henceforth, elements of Aj are labelled 
with roman letters, e.g. k, to distinguish them from the continuous greek-labelized elements of 
A, e.g A. 

For a weakness factor a £ (0,1], the discrete Weak(ad) Matching Pursuit algorithm, or 
dMP(a), of a function / £ L 2 (X) over V A is naturally defined as dMP(a) = MP(A d ,a). The 
replacement of A by Ad in the MP algorithm ([3]) leads obviously to the following question that 
we address in the next section. 

Question 1. How does the MP rate of convergence evolve when the parametrization of a dictio- 
nary is discretized and what are the quantities that control (or bound) this evolution ? 

4.1 Discretization Autopsy 

By working with T>^ instead of T>, the atoms selected at each iteration of dMP(a) are of course 
less optimal than those available in the continuous framework. Answering Question Q] requires 
a quantitative measure of the induced loss in the MP coefficients. More concretely, defining 
the score function S U (X) = {g\,u) 2 for some u £ L 2 (X), we must analyze the difference be- 
tween a maximum of S u computed over A and that obtained from A^. This function u will be 
next identified with the residue of dMP(a) at any iteration to characterize the global change in 
convergence. 

We propose to found our analysis on the geometric tools described in Section [2j 

Definition 1. The value S u (X a ) is critical in the direction of X b if, given the geodesic -y = 7 AaA() 
in the manifold M. = (A,Qij), ■^S u ('y(s))\ s= o = 0, where 7(0) = A a . 

Notice that if S u (X a ) is critical in the direction of A^, 7 n (0) diS u (X a ) = 0. An umbilical 
point for which diS u (X a ) = for all i, is obviously critical in any direction. An umbilical point 
corresponds geometrically either to maxima, minima or saddlepoints of S u relatively to A. 

Proposition 1. Given u £ L 2 (X), if S u (X a ) is critical in the direction of X b for X a , Xb G A, then 
for some r £ (0,dg(X a , X b )), 

\S u (X a ) - S u (X b )\ < \\ u \\ 2 d g (X a ,X b ) 2 {l + \\^\ s=r \\), (5) 
where j(s) = 7 A x (s) is the geodesic in A4 linking X a to X b . 
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Proof. Let us define the twice differentiable function ip(s) = S u ('y(s)) on s £ [0,?]], with rj = 
dg(X a ,Xb). A second order Taylor development of ip gives, for a certain r G (0, s), tp(s) = 
ij)(0) + sip'(0) + is 2 V"(r). Since ^'(0) = 7^(0) diS u (X a ) = by hypothesis, we have in s = r), 
|V»(0) - <%)[ = |S U (A„) - 5„(A 6 )| < \rf W'{r)\. However, on any s, W(s)\ = 2\(±g, /{sh u) 2 + 
(9 7 (s), u ) (-hz9-y(s), u )\ < 2 (II^5 7 ( S )I| 2 + Ila^3 7 ( s )ll) \W\\ 2 , using the Cauchy-Schwarz (CS) inequal- 
ity in L 2 (X) in the last equation. The result follows from the fact that || g^fi^s) || = 1- D 

The previous Lemma is particularly important since it bounds the loss in coefficient value 
when we decide to choose S u (Xb) instead of the optimal S u (X a ) in function of the geodesic distance 
dg(X a , Xb) between the two parameters. To obtain a more satisfactory control of this difference, 
we need however a new property of the dictionary. 

We start by defining the principal curvature in the point A G A as 



d 2 

ICI=i 



K\ = sup |ld^S 7c (s)L H, (6) 



where 7^ is the unique geodesic in M starting from A = 7^(0) and with 7^(0) = £, for a direction 
£ of unit norm in T\A. 

Definition 2. The condition number of a dictionary T> is the number KL~ l obtained from 

K. = sup/C A . (7) 

AeA 

If K, does not exist (not bounded fC\), by extension, T> is said to be of zero condition number. 

The notion of condition number has been introduced by Niyogi et al. [32] to bound the local 
curvature of an embedded manifolclfl in its ambient space, and to characterize its self- avoidance. 
Essentially, it is the inverse of the maximum radius of a sphere that, when placed tangent to the 
manifold at any point, intersects the manifold only at that point [33,34]. Our quantity /C _1 is 
then by construction a similar notion for the dictionary T> seen as a manifold in L 2 (X). However, 
it does not actually prevent manifold self-crossing on large distance due to the locality of our 
differential analysis 

Proposition 2. For a dictionary T> = diet (A), 

1 

1 < K < sup \(d tJ gx,d kl gx)G lk g jl Y , (8) 
AeA L J 

where = Q Z ^{X) is the invers^^ ofQij. 

8 In their work, the condition number, named there t _1 , of a manifold M' measures the maximal "thickness" 
r of the normal bundle, the union of all the orthogonal complement of every tangent plane at every point of the 
manifold. 

9 A careful study of local self-avoidance of well-conditioned dictionary would have to be considered but this is 
beyond the scope of this paper. 

10 Using Einstein convention, this means Q lk Gkj = GjkG ' = S), for the Kronecker's symbol 8* = 8ij — S 1 -* = 1 if 
i = j and if i 7^ j. 
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The proof is given in Appendix [B] since it uses some elements of differential geometry not 
essential in the core of this paper. The interested reader will find also there a slightly lower bound 
than the bound presented in ([5]), exploiting covariant derivatives, Laplace-Beltrami operator and 
scalar curvature of M [28] . We can state now the following corollary of Proposition [TJ 

Corollary 1. In the conditions of Proposition^ if V has a non-zero condition number Kr 1 , 
then 

\S u (Xa) - S u (X b )\ < ||u|| 2 ^(A Q ,A b ) 2 (l + /C). (9) 

Therefore, in the dMP(a) decomposition of / based on T>&, even if at each iteration the exact 
position of the continuous optimal atom of T> is not known, we are now able to estimate the 
convergence rate of this MP provided we introduce a new quantity characterizing the set A^. 

Definition 3. The density radius p^ of a countable parameter space Ad C A is the value 

p d = sup inf dg(X, k). (10) 
AeA fe eA d 

We say that Ad covers A with a radius pd- 

This radius characterizes the density of Ad inside A. Given any A in A, one is guaranteed 
that there exists an element k of Ad close to A, i.e. within a geodesic distance pd- 

Theorem 1. Given a Hilbert space S C L 2 (X) with a non zero greedy factor (5, and a dictionary 
T> = dict(A) C S of non-zero condition number K. , if Ad covers A with radius pd, and if 
Pd < /3/Vl + then, for functions belonging to S, a dMP(a) algorithm using = dict(Ad) 
is bounded by the exponential convergence rate of a cMP(a') using V with a weakness parameter 
given by a' = a{\ - (5~ 2 p 2 d (l + K,)) 1/2 < a. 

Proof. Notice first that since / £ S and 2?d C T> C S, R m f £ S for all iteration m of dMP. 
Let us take the (m + l) th step of dMP(a) and write u = R m f. We have of course ||i? m+1 /|| 2 = 
IMP — S u {k m+ i), where k m +\ is the atom obtained from the selection rule (j3cj) . i.e. S u (k m+ i) > 

" 2 su PfceA d Su (AO- 
Denote by g~ x the atom of V that best represents R m f, i.e. S U (X) = sup AeA S u (A) . If k is 
the closest element of A in Ad, we have dg(X, k) < pd from the covering property of Ad, and the 
Proposition □ tells us that, with u = R m f, \S u (k) - S U (X)\ < p\ (1 + K) ||n|| 2 , since diS u (X) = 
for all i. 

Therefore, S u {k) > S U (X) - p\ (1 + K) \\u\\ 2 > (5 2 \\u\\ 2 - p 2 (1 + K) \\u\\ 2 , and S u (k) > (3 2 (l - 
(3~ 2 pj(l + /C)) ||i? m /|| 2 , this last quantity being positive from the density requirement, i.e. p& < 

p/VTTTc. 

In consequence, 5„(A; m+ i) > a 2 sup fcgAd S u (k) > a 2 S u (k), implying ||i?' m+1 /|| 2 = \\u\\ 2 - 
S u (k m+1 ) < \\u\\ 2 ■ a 2 S u (k) < \\u\\ 2 (l - a' 2 (3 2 ), for a' = a(l - f3~ 2 p 2 d (l +/C)) 1/2 . So, 
||^m+ij|| < (i _ a' 2 j3 2 )( m+l ^ 2 1|/||, which is the exponential convergence rate of the Weak(a) 
Matching Pursuit in T> when (3 exists [29,30]. □ 

The previous proposition has an interesting interpretation : a weak Matching Pursuit de- 
composition in a discrete dictionary corresponds, in terms of rate of convergence, to a weaker 
Matching Pursuit in the continuous dictionary from which the discrete one is extracted. 
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About the hypotheses of the proposition, notice first that the existence of a greedy factor 
inside S concerns the continuous dictionary T> and not the discrete one T>&. Consequently, 
this condition is certainly easier to fulfill from the high redundancy of T>. Second, the density 
requirement, pd < + K,, is just sufficient since the Proposition Q] does not state that it 

achieves the best bound for the control of |5 U (A Q ) — S u (Xb)\ when A a is critical. It is interesting 
to note that this inequality relates pd, a quantity that characterizes the discretization Aj in A, to 
and /C, which depend only on the dictionary. In particular, (3 represents the density of T> inside 
S C L 2 (X), and K, depends on the shape of the atoms through the curvature of the dictionary. 

Finally note that as (3 < 1 (from definition (j4|)) and JC > 1 (Prop. [2]), the density radius must 
at least satisfy pd < -j= to guarantee that our analysis is valid. 

4.2 A Simple Example of Discretization 

Let us work on the line with L 2 (X) = L 2 (R, dt), and check if the hypothesis of the previous 
theorem can be assessed in the simple case of an affine (wavelet-like) dictionary. 

We select a symmetric and real mother function g £ L 2 (R) well localized around the origin, e.g. 
a Gaussian or a Mexican Hat, normalized such that \\g\\ = 1. The parameter set A is related to the 
affine group, the group of translations and dilations G a ff . We identify A = (A = 6, A 1 = a), where 
5sl and a > are the translation and the dilation parameters respectively. The dictionary T> is 
defined from the atoms g\(t) = [U(X)g](t) = aT l/2 g((t-b)/a), with \\g x \\ = 1 for all A £ A. Our 
atoms are nothing but the wavelets of a Continuous Wavelet Transform if g is admissible [35], 
and U is actually the representation of the affine group on I? (R) [36] . 

In the technical report [31], we prove that the associated metric is given by Gij(X) = a~ 2 W, 
where W is a constant 2x2 diagonal matrix depending only of the mother function g and its first 
and second derivatives. Since Q^(X) = a 2 W~ l , K, can be bounded by a constant also associated 
to g and its first and second order time derivatives. 

Finally, given the r-adic parameter discretization 

Ad = {kjn = (bj n ,aj) = (nb T j ,a T j ) : j,n£Z}, 

with t > 1 and ao, fro > 0, the density radius pd of Ad is shown to be bounded by pd < 
C(Zq 1 &o + Dim, with C and D depending only of the norms of g and its first derivative. 

This bound has two interesting properties. First, as for the grid Ad, it is invariant under the 
change (bo, ao) — > (2&0j 2ao). Second, it is multiplied by 2 n if we realize a "zoom" of factor 2 n in 
our r-adic grid, in other words, if (bo,T) — ► (2 n bo, r 2 ™). By the same argument, the true density 
radius has also to respect these rules. Therefore, we conjecture that pd = C'a^bo + D' lnr, for 
two particular (non computed) positive constants C and D' . 

Unfortunately, even for this simple affine dictionary, the existence of (3 = (3(S, T>) is non trivial 
to prove. However, if the greedy factor exists, the control of r, ao and bo over pd tells us that it 
is possible to satisfy the density requirement for convenient values of these parameters. 

5 Optimization of Discrete Matching Pursuits 

The previous section has shown that under a few assumptions a dMP is equivalent, in terms of 
rate of convergence, to a weaker cMP in the continuous dictionary from which the discrete one 
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has been sampled. 

Question 2. Can we improve the rate of convergence of a dMP, not with an obvious increasing 
of the dictionary sampling, but by taking advantage of the dictionary geometry ? 

Our approach is to introduce an optimization of the discrete dMP scheme. In short, at each 
iteration, we propose to use the atoms of T>^ as the seeds of an iterative optimization, such as 
the basic gradient descent/ascent, respecting the geometry of the manifold A4 = (A,Qij). 

Under the same density hypothesis of Theorem [TJ we show that in the worst case and if the 
number of optimization steps is large enough, an optimized discrete MP is again equivalent to a 
continuous dMP, but with a weakness factor two times closer to unity than for the non-optimized 
discrete MP. 

In this section, we first introduce the basic gradient descent/ascent on a manifold. Next, we 
show how this optimization can be introduced in the Matching Pursuit scheme to defined the 
geometrically optimized MP (gMP). Finally, the rate of convergence of this method is analyzed. 

5.1 Gradient Ascent on Riemannian Manifolds 

Given a function u 6 L 2 {X) and S U (X) = (g\,u) 2 , we wish to find the parameter that maximizes 
S u , i.e. 

A* = argmax S U (X) (P-l) 
AeA 

Equivalently, by introducing h u \ = (g\, u) g\, we can decide to find A* by the minimization 

A* = argmin \\u — h Ut \\\ 2 . (P-2) 
AeA 

If we are not afraid to get stuck on local maxima (P.l) or minima (P.2) of these two non-convex 
problems, we can solve them by using well known optimization techniques such as gradient 
descent /ascent, or Newton or Newton-Gauss optimizations. 

We present here a basic gradient ascent of the Problem (P.l) that respect the geometry of 
M. = (A,Qij) [37]. This method increases iteratively the value of S u by following a path in A, 
composed of geodesic segments, driven by the gradient of S u . 

Given a sequence of step size t r > 0, the gradient ascent of S u starting from Ao € A is defined 
by the following induction [38] : 

^o(Ao) = A, <j) r+1 (\ ) = j(t r , r (A O ), Cr(Ao)), 

where £>(Ao) = | V5 u (^ r (Ao))| _1 VS , n (0 r (Ao)) is the gradient direction obtained from the gradient 
V l S u = Q %3 djS u , and j(s, Ao, Co) is the geodesic starting at Ao = 7(0, Ao, Co) with the unit velocity 
Co = ^7(0, Ao,Co)- Notice that V* is the natural notion of gradient on a Riemannian manifold. 
Indeed, as for the Euclidean case, with V*/i = djh for h E L 2 (X), given w G T\A, the 
directional derivative D w h is equivalent to D w h{\) = w % dih{\) = (Vh,w)\ = w l V 3 h{\) ^ (A), 
since Q ik Q k j = <5}- 

Practically, in our gradient ascent, we use the linear first order approximation of 7, i.e. 

</y+i(A) = </y(A) + *rfr(A), (11) 
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valid for small value of t r (error in 0(t 2 )). This is actually an optimization method since 
diS u (M^)e r = \dS u (M^))\ > and S u (cf> r+l (X)) = S u ((f) r (\)) + t r \dS u ((f) r (\))\ + 0(t 2 r ) > 
Sui^rW), f° r a convenient step size t r > 0. At each step of this gradient ascent, the value 
t r is chosen so that S u is increased. This can be done for instance by a line search algo- 
rithm [39]. From the positive definiteness of Qy and Q % i , a fixed point <^> r+1 (A) = 4> r {\) is 
reached if V i 5 u (^ r (A)) = diS u {^ r {\)) = for all i. 

More sophisticated algorithms such as Newton or Newton-Gauss can be developed to solve 
the Problem (P.2) on a Riemannian manifolds [38, 40] even if, unlike to the flat case, a direct 
definition of the Hessian does not exist on differentiable manifolds. However, we will not use them 
here as our aim is to prove that a dMP driven by the very basic optimization above provides 
already a better rate of convergence than the non-optimized dMP. 

5.2 Optimized Discrete Matching Pursuit Algorithm 

Let us optimize each step of a discrete MP using the gradient ascent of the previous section. 

Definition Given sequence of positive integers K m and a weakness factor < a < 1, the 
geometrically optimized discrete matching pursuit (gMP(a)) is defined by 

R°f = f (initialization), (12a) 

R m+l f = R rn f _ (g Um+l1 Rmf)g Um+ll ( 12 b) 

(9u m+1 ,R m f) 2 > a 2 sup ( 9(PKm{k) ,R m f) 2 . (12c) 
fceAd 

Notice that the best atom g Um+1 is selected in the set <J> m = {g ( p Km (k) '■ k £ Ad} C V. Elements 
of $ m are determined by applying the optimization function <p r : Aj — > A of our gradient ascent 
defined in (jlip on elements of A^. In consequence, $ m depends on R m f and is thus different at 
each iteration m. 

Rate of convergence The following theorem characterizes the rate of convergence of the 
optimized Matching Pursuit defined in (|12p . 

Theorem 2. Given the notations and the conditions of Theorem d there exists a sequence of 
positive integers K m such that, the gMP(a) decomposition of functions in S C L 2 {X) optimized 
K m steps at each iteration m, is bounded by the same rate of convergence as a cMP(a") using the 
corresponding continuous dictionary T> with a" = q(1 — ^ j3~ 2 p& (1 + /C)) 1 / 2 < a. 

In other words, for a = 1, a gMP is equivalent to a cMP with a weakness factor two times 
closer to unity than the one reached by a dMP in the same conditions. Before proving this result, 
let us introduce some new lemmata. 

Lemma 1. Given afunctionu E L 2 {X) and a dictionary!) of non-zero condition number JC~ l , if 
X a is critical in the direction of Xb, and if Xb is critical in the direction of X a , i.e. 7'*(0) diS u (X a ) = 
r y n {d) diS u (Xb) = for 7 = 7 AaAfc the geodesic joining X a and Xb and d = dg(X a , Xb), then 

\S u (X a ) - S u (X b )\ < l\\u\\ 2 dg(X a ,X b ) 2 (l + lC). (13) 
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Proof. Without loss of generality, assume that S u (X a ) > S u (X b ). If this is not the case, we can 
switch the labels a and b. Let us define X(9) = "f(9d) with 9 G [0, 1] on the geodesic 7 = 7 A(jA ■ 
We have A a = A(0) and X b = A(l). Using the Corollary [TJ the two following inequalities hold : 
S U (X(9)) > S u (X a )-\\u\\ 2 d g (X(9), X a ) 2 (1+/C) and S U (X(9)) < S u (X b ) + \\u\\ 2 dg(X(9), X b ) 2 (1 + /C). 

Therefore, since by definition of X(9), dg(X(9), X a ) = 9d and dg(X(9), X b ) = (1 — 9)d, we find 
Su(Xa) - S u {X b ) < \\u\\ 2 (9 2 + (9- l) 2 ) d 2 (1 + K) for all 9 G [0, 1]. Taking the minimum over all 
9, we obtain finally S u {X a ) - S u (X b ) < \ \\uf dg{X a , X b ) 2 (1 + K). □ 

In other words, the critical nature of A a and X b divides by two the bound on the decreasing 
of S u between them compared to the situation where only one of these points is critical. 

Lemma 2. Given a function u G L 2 (X), assume that S U (X) has a global maximum at Xm, i-e. 
diS u (XM) = for all i, and write Tk = {4> r (k) : r G N} the trajectory of the gradient ascent 
described in (fTT]) starting from a point I: £ Aj. There exists a X' G that can be reached in a 
finite number of optimization steps, such that 

Su(Xm) - Su(X') < ±|M| 2 dg{X M ,kf{l+K). (14) 

For the sake of clarity, the proof of this technical Lemma is placed in Appendix [Cj The main 
idea is to find a point in the trajectory 7^ that is closer to Xm than k, and that is also critical 
in the direction of Am so that Lemma Q] can be applied. Let us now enter in the proof of the 
previous proposition. 

Proof of Theorem^ In our gMP(a) decomposition of a function / G S C L 2 (X) defined before, 
given the iteration m + 1 where u = R m f is analyzed, denote by A the parameter of the atom in 
V maximizing S u , i.e. S U (X) = sup AgA S U (X). 

If k is the closest element of to A, from the covering property of Ad we have dg(X, k) < p,i, 
and the Lemma [2] tells us that there exists a finite number of optimization steps K m such that 
Su(4>K m (k)) > S U (X) -\pd \\uf (1 + K) > P 2 (1 - jj/T 2 pi (1 + £)) IMI 2 , where the last term is 
positive from the density requirement pd < (3/\/l + !C. 

Therefore, from the selection rule (|12c|) . S u (v m+ i) > a 2 S u ((f) Km (k)). We have thus ||i? m+1 /|| 2 = 
\\u\\ 2 -S u (u m+1 )< ||n|| 2 -a 2 540 Km (fc)) < \\u\\ 2 (l-a" 2 (3 2 ), with a" ± a (l-lf3- 2 p 2 (1 + /C)) 1/2 . 
So, ||i? m+1 /|| < (1 — a" 2 (3 2 )^ m+l ^ 2 1|/|| which is also the exponential convergence rate of the 
cMP(a") in V when /? exists. □ 

In Theorem [21 even if the sequence of optimization steps K m is proved to exist, it is actually 
unknown. One practical way to overcome this problem is to observe how the ratio J^hI decreases 
at each optimization steps, and to stop the procedure once this value falls below a predefined 
threshold. This follows from the idea that the closer to a local maximum S u (4> r (k)) is, the smaller 
must be the optimization step. As it is often the case in optimization problems, an upper bound 
on the number of optimization steps can be fixed jointly to this threshold test. 

6 Experiments 

In this section, dMP and gMP decompositions of 1-D and 2-D signals are studied experimentally 
in different situations. These will imply different classes of signals and different discretization of 
parametrization of various densities. 
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Prior to these experiments, some remarks have to be made about dMP and gMP implemen- 
tations. First, for both algorithms, as described in Equations ([3j) and (|12p . a full-search has to 
be performed in = dict(Ad) to compute all the squared scalar products S u of the current 
residue u = R m f, with atoms gt- We decide thus to reduce the computational complexity of 
this full-search with the help of the Fast Fourier Transform (FFT). One component (for 1-D sig- 
nals) or two components (for 2-D signals) of the parametrization correspond indeed to a regular 
grid of atoms positions, which makes S u a discrete correlation relatively to these parameters. 
Moreover, as described in detail in [41,42], we apply the fast boundary renormalization of atoms, 
where atoms of T> truncated by the limit of the signal remain valid atoms, i.e. of unit norm, and 
features that suddenly terminate at the signal boundary are correctly caught in the procedure. 
Notice that all our dMP and gMP experiments are performed in the non-weak case, i.e. a = 1. 

Second, for the Gradient-Ascent optimization, we realize some simplifications to the initial 
formulation : the best discrete atom only is optimized at each MP iteration and K m = k > for 
all m, with k typically equal to 5 or 10. Even if these two restrictions are not optimal compared 
to the method described in the theoretical results, the gain of the optimization in the quality of 
signals reconstructions is already impressive. We also set all the step sizes to t r = x > 0, with 
X = 0.1 in all our experiments. Then, at each optimization step r, we adaptively decrease the step 
parameter t r by dividing it by 2 if the ascent condition is not met, i.e. if S u {4> r+ \{k)) < S u ((j) r (k)). 
If after 10 divisions, the ascent condition still does not hold, the optimization process is simply 
stopped. 

Finally, let us mention that our algorithms are written in MATLAB© and are consequently 
not truly optimized. The different computation times that we provide through this section allow 
us only to compare various schemes, as for dMP and gMP decomposition of the same signal. All 
of our experiments were realized on a Pentium 1.73 GHz with 1Gb of memory. 



6.1 One Dimensional Analysis 

This section analyzes the benefit obtained from gMP, and from an increased density of the discrete 
dictionary, when decomposing some specific classes of randomly generated 1-D signals. In our 
experiments, each 1-D signal is of unit norm and has N = 2 13 samples. Each signal consists of 
the sum of 100 random bursts, each burst being a rectangular or Gaussian window, depending on 
the class of the signal. The position and magnitude of each burst is selected at random, according 
to a uniform distribution. The duration of the rectangular window and the standard deviation 
of the Gaussian function are selected uniformly within the range [\L, \L], for L = 2 8 . The 
mother function of the dictionary is the Mexican Hat function g(t) tx (1 — t 2 ) e - * 2 / 2 . Its scale and 
translation parameters are sampled as defined in Section [4.2[ following the r-adic discretization 
Ad = {(n&o*7~ J , clqt 3 ) : j,n G Z}, with ao = 1. We work in the non-weak case, i.e. a = 1, for dMP 
and gMP, and we set k = 10 for gMP. 



Figures 1(a) and 1(b) analyze how the energy ||i? m /|| 2 of the residual decreases with the 
number m of MP iterations for the random Gaussian and rectangular signals, respectively. Notice 
that only a small number of iterations are studied (twelve) since our analysis aims at analyzing 
the behaviour of dMP and gMP on one class of signals. However the current residual R m f 
belongs only approximately to the considered class on small m when not many atoms have been 
substracted to / = R°f. Results presented are averaged over 20 trials. In each graph, two 
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distinct discretizations of the Mexican Hat parameters are considered to provide two discrete 
dictionaries, with one (60 = l,log 2 T = 0.25) being two times denser than the other (60 = 
2,log 2 r = 0.5), according to the behavioJ^l of the density radius p& analyzed in Section |4T2"1 
Both discrete and geometrically optimized MP are studied for each dictionary. We observe that 
gMP significantly outperforms dMP, and that an increased density of the dictionary also speeds 
up the MP convergence. By comparing Figure 1(a) and 1(b), we also observe that the residual 
energy decreases much faster for Gaussian signals than for rectangular ones, which unsurprisingly 
reveals that the Mexican Hat dictionary is better suited to represent Gaussian structures. 

Figures l(c)|l(f) further analyze the impact of the discretization of the dictionary parameters 
on MP convergence. In these figures, we introduce the notion of normalized atom energy (NAE) 
to measure the convergence rate of a particular dictionary dealing with a specific class of signals 
at a specific MP iteration step. Formally, the NAE denotes the expected value of the best squared 
atom coefficient computed on a normalized signal when this one is randomly generated within a 
specific class of signals. Mathematically, NAE = E[(gA,j ]jl7j|) 2 ] ' w ^ ere u ls a sample signal of the 
class and the g\ t the associated best atom for a fixed greedy algorithm (dMP or gMP). We show 
the dependence of NAE on the discretization for the 1 st and 30*^ iteration^! for both rectangular 
and Gaussian signals. Results are averaged over 500 trials. 

By considering the dMP and gMP curves in Figures l(c)||l(f )" we first observe that the NAE 
is significantly higher for gMP than for dMP, which confirms the advantage of using gradient 
ascent optimization to refine the parameters of the atoms extracted by dMP. Note that the NAE 
for a Gaussian random signal (Fig. l(c)||l(d) ) is nearly one order of magnitude higher than for 
a rectangular one (Fig 1(e) 1(f) ). This confirms that the Mexican Hat dictionary better matches 



the Gaussian structures than the rectangular ones. In all cases, the NAE sharply decreases with 
the iteration index, which is not a surprise as the coherence between the signal and the dictionary 
decreases as MP expansion progresses. 

To better understand the penalty induced by the discretization of the continuous dictionary, 
we now analyze how the rate of convergence for a particular class of signals behaves compared to 
the reference provided by a signal composed of a single Mexican Hat function. For that purpose, 
an additional curve, denoted dMP a , has been plotted in each graph. This curve is expected to 
provide an upper bound to the penalty induced by a sparser dictionary. Specifically, dMP a plots 
the energy captured during the I s * step of the dMP expansion of a random (scale and position) 
Mexican Hat function, as a function of the discretization parameter log 2 r. As the Mexican Hat 
is the generative function of the dictionary, the 1 st step of the MP expansion would capture the 
entire function energy if the entire continuous dictionary were used, but is particularly penalized 
by a discretization of the dictionary. In each graph of Figures l(c)|l(f) , to compare dMP a with 
dMP, the dMP a curve obtained with pure atoms (i.e. unit coefficients) is scaled to correspond to 
atoms whose energy is set to the NAE expected from the expansion of the corresponding class of 
signals with a continuous dictionary. In practice, the NAE expected with a continuous dictionary 
is estimated based on the NAE computed with gMP and the densest dictionary (log 2 r = 0.25). 



11 Obviously equivalent for log 2 r or lnr variations. 

12 Note that the NAE at the 30 th iteration refers to the NAE computed on the residual signals obtained after 
29 iterations of the gMP with the densest dictionary, independently of the actual discrete dictionary considered at 
iteration 30. Hence, the reference class of signals to compute the NAE at iteration 30 is the same for all investigated 
dictionaries, i.e. for all log 2 r values. 
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Figure 1: (a)-(b) Residual energy as a function of the MP iteration. dMP (bo, log 2 t) and 
gMP (bo, log 2 t) refer to discrete and optimized MP, computed on a discretization Aj = 
{(nboT 3 , aoT- 7 ) : j, n € Z} of the continuous Mexican Hat dictionary, (c)-(f) Normalized atom 
energy (NAE) as a function of the log 2 r discretization parameter, bo is set to one in all cases. 
dMP and gMP respectively refer to discrete and optimized MP. dMP a provides a lower bound to 
the decrease of NAE with log 2 r, and is formally described in the text. 
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The approximation is reasonable as we observe that gMP saturates for small log 2 r values, i.e. 
for large densities. We first observe that both the dMP and the dMP a curves nearly coincide 
in Figure 1(c) Hence, the MP expansion of a Gaussian signal is penalized as much as the one 



of a Mexican Hat function by a reduction of the discrete dictionary density. We then observe 
that the penalty induced by a reduction of density decreases as the coherence between signal 
and dictionary structures drops. This is for example the case when the signal to represent 

or because the coherent 



is intrinsically sharper than the dictionary structures (Fig. l(e, 



1(0), 



structures have been extracted during the initial MP steps (Fig. 1(d)). This last observation is 
of practical importance because it reveals that using a coarsely discretized dictionary incurs a 
greater penalty during the first few iterations of the MP expansion than during the subsequent 
ones. For compression applications, it might thus be advantageous to progressively decrease the 
density of the dictionary along the expansion process, the cost associated to the definition of the 
atom indices decreasing with the density of the dictionar}0. Hence, it might be more efficient - 
in a rate-distortion sense - to use a dense but expensive dictionary during the first MP iterations, 
so as to avoid penalizing the MP convergence rate, but a sparser and cheaper during subsequent 
steps, so as to save bits. We plan to investigate this question in details in a future publication. 



6.2 Two Dimensional Analysis 

This section analyzes experimentally the effect of discretizing a dictionary on the Matching Pur- 
suit decomposition of images, i.e. with the Hilbert space L 2 (M 2 ). 



Parametrization and Dictionary We use the same dictionary as in [41]. Its mother function 
g is defined by a separable product of two 1-D behaviors : a Mexican Hat wavelet in the in- 
direction, and a Gaussian in the y-direction, i.e. g(x) = (^r) 1 / 2 (1 — x 2 ) exp(— \ |x| 2 ), where 
x = (x,y) G ]R 2 and ||<?|| = 1 [30]. Notice that g is infinitely differentiable. 

The dictionary is defined by the translations, rotations, and anisotropic dilations of g. Mathe- 
matically, these transformations are represented by operators 7b, Rg, and D a , respectively. These 
are given by [T b #](x) = #(x - b), [Rg g]{x) = ^(r^x), and [D a #](x) = (aia 2 )~ 1/2 g(d a x x) , 
for 9 G S 1 ~ [0, 27i"), b G M 2 , a = (a±, a 2 ), ai, a 2 G K_, while rg is the usual 2x2 rotation matrix 
r e and d a = diag(ai,a 2 ). 

In other words, we have a parametrization of P = 5 dimensions and A = {A = (A , . . . , A 4 ) = 
(&i, b 2 , 6, oi, a 2 ) G M? x S 1 x (]R+) 2 }. At the end, each atom of the dictionary V = {g\ : A G A} 
is generated by g A (x) = [Z7(A) </](x) = [T b Rg D a g] (x) , with \\g x \\ = \\g\\ = 1. 

Obviously, the dictionary T> is complete in L 2 (M?). Indeed, translations, rotations and 
isotropic dilations alone are already enough to constitute a wavelet basis of L 2 (X) since g is 
an admissible wavelet [35,43]. Finally, as requested in the previous section, from the smoothness 
of g and of the transformations U above, the atoms g\ of our dictionary T> are twice differentiable 
on each component A*. 



Spatial Sampling For all our experiments, images are discretized on a Cartesian regular grid 
of pixels, i.e. an image / takes its values on the grid X = ([0, N x ) x [0, N y )) nZ 2 , with N x and N y 
the "x" and "y" sizes of the grid. We work in the continuous approximation, that is we assume 

13 Less distinct atom indices need to be described by the codewords. 
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(a) (b) 



Figure 2: 300 atoms reconstruction results, (a) dMP : J = 5, K = 8, PSNR: 26.63 dB, 4634s. 
(b) gMP : J = 3, K = 4, k = 10, PSNR: 26.68 dB, 949s. 





J = 3 


J = 5 


K = 4 

(k = 5) 
(k = 10) 


24.30 dB (834s) 


25.88 dB (2327s) 


26.08 dB (889s) 
26.68 dB (950s) 


27.09 dB (2381s) 
27.37 dB (2447s) 


K = 8 

(« = 5) 
(k = 10) 


25.21 dB (1660s) 


26.63 dB (4634s) 


27.05 dB (1715s) 
27.44 dB (1772s) 


27.92 dB (4703s) 
28.09 dB (5131s) 



Table 1: dMP and gMP applied on Barbara image. Quality (in PSNR) of the reconstruction 
after 300 iterations for various J, K and k. In each table cell, the first row correspond to dMP 
result, the second and the third rows to gMP. 



that the grid X is fine enough to guarantee that the scalar products (•, •) and norms || • || are well 
estimated from their discrete counterparts. This holds of course for band- limited functions on 
L(R 2 ). 

In consequence, in order to respect this continuous approximations and to have dictionary 
atoms smaller than the image size, the mother function g of our dictionary D must be dilated in a 
particular range of scales so that g\ is essentially band-limited, i.e. ai,a 2 E [a m , o-m]- According 
to the definition of g above, we set experimentally a m = 0.7 and «m = min(A^ x , N y ). 

Discrete Parameter Space We decide to sample regularly A so that to have iVpi x = N x N y 
positions b, J 2 scales a\ and a 2 selected logarithmically in the range [a m , om], and K orientations 
evenly spaced in [0, ir) , with J, K G N. At the end, we obtain the discretized parameter set 
A d = A d (iV pix , J,K) = {(b,9 n , aij ,a 2j i), b e X, n £ [0, K - 1], j,f £ [0, J - 1]}, and the 
corresponding dictionary T>d(N p [ x , J, K) = dict(A ( j(iVpi x , J, K)). The number of atoms in the 
dictionary is simply |2? d | = J 2 K iV" p } x . 



17 



Image name 


dMP 


gMP (k = 10) 


Barbara 
Lena 
Baboon 
Cameraman 
GoldHill 
Peppers 


25.94 dB (2707s) 

26.50 dB (2709s) 
24.06 dB (2770s) 
25.80 dB (2807s) 
26.54 dB (2810s) 

24.51 dB (2853s) 


27.86 dB (2820s) 
28.53 dB (2857s) 
24.93 dB (2900s) 
27.62 dB (2918s) 
28.12 dB (2961s) 
26.69 dB (3013s) 



Table 2: Comparison of dMP and gMP on different usual images of size 128x128. Computations 
have been performed for J = 4, K = 8, 300 atoms. Computation times are given indicatively in 
parenthesis. 

Results We start our experiment by decomposing the venerable image of Barbara. 300 atoms 
were selected by dMP and gMP for various J and K. Results are presented in Table HJ In these 
tests, the best quality obtained for dMP corresponds obviously to the finest grid, i.e. J = 5 and 
K = 8 (26.63 dB, Fig |2(a)[ ), with a computational time (CT) of 4634s. With 10 optimization 
steps (k = 10), the gMP for the coarsest parametrization ( J = 3 and K = 4) is equivalent to the 
best dMP result with a PSNR of 26.68 dB and a CT of only 950s, i.e. almost five time faster. 
This is also far better than the dMP on the same grid (24.30 dB). The visual inspection of the 
dMP image (J = 5, K = 8, Fig ]2(a)D and the gMP image ( J = 3, K = 4, k = 10, |2(b)[ ) is also 
instructive. Most of the features of the gMP results are well represented (e.g. Barbara's mouth, 
eyes, nose, hair, ...). However, the regular pattern of the chair in the background of the picture, 
which needs a lot of similar atoms, is poorly drawn. This can be explained by the fact that this 
highly directional structure has to be represented by a lot of similarly oriented and scaled atoms 
with similar amplitude. The fine grid of dMP has therefore more chance to correctly fit these 
atoms, while the gMP on its coarse grid is deviated in its optimization process to more prominent 
structure with higher amplitudes. Notice finally, the best optimized result (PSNR 28.09 dB) is 
obtained for k = 10 on the grid associated to J = 5 and K = 8 orientations. 

For our second experiment, we compare dMP and gMP (k = 10) 300 atoms approximation of 
well known 128x128 pixels pictures, namely Lena, Baboon, Cameraman, GoldHill, and Peppers, 
on the same parametrization grid (J = 4, K = 8). For a computational time slightly higher (5%) 
than the dMP decomposition, we reach in all cases a significantly higher PSNR with gMP than 
with dMP, i.e. the dB gain is within the range [0.87,2.03]. 



7 Related Works 

A similar approach to our geometric analysis of MP atom selection rule has been proposed 
in [24]. In that paper, a dictionary of (L 2 -normalized) wavelets is seen as a manifold associate 
to a Riemannian metric. However, the authors restrict their work to wavelet parametrization 
inherited from Lie group (such as the affine group). They also work only on the L 2 (dictionary) 
distance between dictionary atoms and do not introduce intrinsic geodesic distance. They define 
a discretization of the parametrization A such that, in our notations, QijAX l AX :) < e, with AX(k) 
the local width of the cell localized on k 6 A<j. There is however no analysis of the effect of this 
discretization on the MP rate of convergence. 
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In [14], the author uses a 4-dimensional Gaussian chirp dictionary to analyze 1-D signals 
with MP algorithm. He develops a fast procedure to find the best atom of this dictionary in the 
representation of the current MP residual by applying a two-step search. First, by setting the 
chirp rate parameter to zero, the best common Gabor atom is found with full search procedure 
taking advantage of the FFT algorithm. Next, a ridge theorem proves that starting from this 
Gabor atom, the best Gaussian chirp atom can be approximated with a controlled error. The 
whole method is similar to the development of our optimized matching pursuit since we start 
also from a discrete parametrization to find a better atom in the continuous one. However, our 
approach is more general since we are not restricted to a specific dictionary. We use the intrinsic 
geometry of any smooth dictionary manifold to perform a optimization driven by a geometric 
gradient ascent. 

8 Conclusions 

In this paper, we have adopted a geometrical framework to study the effect of dictionary dis- 
cretization on the rate of convergence associated to MP. In a first step, we have derived an upper 
bound for this rate using geometrical quantities inherited from the dictionary seen as a manifold, 
such as the geodesic distance, the condition number of the dictionary, and the covering property 
of the discrete set of atoms in the continuous dictionary. We have also shown in a second step how 
a simple optimization of the parameters selected by the discrete dictionary, can lead theoretically 
and experimentally to important gain in the approximation of (general) signals. 

In a future study, it could be interesting to see how our methods extend to other greedy 
algorithms, like the Orthogonal Matching Pursuit (OMP) [44]. However, this extension has to 
be performed carefully since we need to characterized the convergence of continuous OMP, as it 
is here for the one of MP induced by the existence of a greedy factor. 

Our work paves the way for future extensions and advances in two practical fields. As ex- 
plained in our 1-D experiments, a first idea could be to analyze carefully the benefit - in a 
rate-distortion sense - of using a dense but expensive dictionary during the first MP iterations, 
so as to avoid penalizing the MP convergence rate, but a sparser and cheaper dictionary during 
subsequent steps, so as to save bits. We plan to investigate this question in details in a future 
publication. 

Another idea is to analyze the behaviors of gMP in the Compressive Sensing (CS) formalism, 
that is after random projection of the signal and atoms. Matching Pursuit is already used 
currently as a retrieval algorithm of CS of sparse signals [45-47]. However, recent results [48] 
suggests also that for manifold of bounded condition number, their geometrical structure (metric, 
distances) is essentially preserved after random projection of their points in a smaller space than 
the ambient one. If a natural definition of random projection in our continuous formalism can 
be formulated, a natural question is thus to check if the gradient ascent technique survives after 
random projection of the residual and the atoms on the same subspace. This could lead to 
dramatic computation time reduction, up to controlled errors that could be even attenuated by 
the greedy iterative procedure. 
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A Complements on the Geometry of (A, Q^) 

In this short appendix, we provide some additional information on the geometrical concepts 
developed in Section[2l First, as explained in that section, the parameter space A of the dictionary 
T> = diet (A) is linked to a Riemannian manifold Ai = (A, Qij) with a structure inherited from the 
dictionary V C L 2 (X). From the geodesic definition (pQ) and the metric relation ([2]), we see that 
the curve 7 AaAfc (^) £ A is thus also a geodesic in Ai. In other words, it is defined only from the 
metric Qij and not anymore from the full behavior of atoms of T> C L 2 (X). In [31], we explain 
also that Ai is in fact an immersed manifold [28] in the Hilbert manifold T> C L 2 (X), and Qij 
is the associated pullback metric. All the geometric quantities of the Riemannian analysis of Ai, 
such as Christoffel's symbols, covariant derivatives, curvature tensors, etc. can be defined. This 
is actually done in the following appendices of this paper. 

Second, some important designations can be introduced. The metric Qij(X) is a (covariant) 
tensor of rank-2, i.e. described by two subscript indices, on Ai. This means that Qij satisfies 
a specific transformation under changes of coordinates in T\A such that the values of the bi- 
linear fornix G\{£,0 — C & that it induces are unmodified 1 ^!. A function / : A — > R 
is a scalar field on A^, or rank-0 tensor. A vector field CW on this manifold, which asso- 
ciates to each point A a vector in the tangent plane T\A, is a function £ : A — ► T\A ~ R p 
also named (contravariant) rank-1 tensor, i.e. with one superscript. More generally, a rank- 
(m,n) tensor is a quantity T^ 1 .'.'. j™(A) m-times contravariant and n-times covariant such that 

Qnki ■ ■ ■ Qi m k m Ci 1 • • • £m m Tjl '." j™W • • • Qn is invariant under change of coordinates in T\A 
for any vectors {£i, • • • , £ m , Cl> • • • , Cn} in this space. 

B Proof of Proposition [2] 

Let 7 be a geodesic in Ai with curvilinear parametrization, i.e. with |7'(s)| = 1- Writing 7 = 7(5) 
and V = ^7(s), we have ^ g 7(s) = <9i# 7 7" and ^2 = d^g^ +d k g~, i' k \ where we write 
abusively dig 1 = digx\\ =1 ( s ) and similarly for second order derivative. 

We need now some elements of differential geometry. Since 7 is a geodesic in Ai, it respects 
the second order differential equation j" k + T^- r ) n ^i'^ = 0, where the values = ^ Q lk (dj Qu + 

14 Also named first fundamental form [28]. 

15 In the same way that the scalar product between two vectors in the usual Euclidean space is independent of 
the choice of coordinates. 
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di Qji — d[ Qijj are the Christoffel's symbols [28] derived from the metric Qij. Therefore, we get 



rcl^g, = % 577 'V i -^ 7 r^yV i (15) 
= V^W', (16) 

where Vj<? 7 = dig 1 and Vijg 1 = VjVj<7 7 = dij g y — d k g y T^- are by definition the first order i 
and the second order ij covariant derivatives of g 7 respectively [28]. In addition, we can easily 
compute that for M. = (A, Qij), 

T% = Q kl (dijg x ,d l9x ). (17) 

The lower bound of the proposition comes simply from the projection of 4-% <? 7 ( s ) onto g-y. 
Indeed, for any A G A, since ||#a|| 2 = (g\,9\) = 1, {dig\,g\) = and {d ijg \,g\) = -Qij. By 
(HSJ), (gpr 9 7 ( s ),9y) = {9ijg y , g y ) W J = -Qijl'W 3 = -1, and using Cauchy-Schwarz we get 
Hgjr 97(a) || — 1- Therefore, for e > and 7^ : [0, e] — ► A, a segment of geodesic starting from A 
with unit speed £, 

For the upper bound, coming back to any geodesic 7, we need to analyze directly ||^ 9-y(s) IP- 
Using (fT6|) and the expression (fT7|) of the Christoffel's symbols above, we have II 57(s) II 2 = 
W^ij S7 j' 1 ^ 3 \\ 2 < (V \j 57, V 'ki gj) Q tk Q jl , where we used |7'| = 1 and the Cauchy-Schwarz (CS) 
inequality expressed in the Einstein's summation notation on rank-2 tensors. This latter states 
that, for the tensors A {j = Vy # 7 and B ij = VY', \AijB^\ 2 < \A^ A kl Q ki Q l i \ \B l 'J B kl Q ki Qij\, 
the equality holding if the two tensors are multiple of each other. We prove in [31] the general 
explanation for rank-n tensor as a simple consequence of the positive-definiteness of Qij. 

Therefore, taking 7 = 75, and since 7^(0) = A, 



± 

K < sup \(Vij gx ,V k i gx) Q ik Q jl V ■ (18) 
AeA L J 



AeA 

In the companion Technical Paper [31], we prove that this inequality is also equivalent to 

K < sup [R(X) + \\Ag x \\ 2 ]^, 
AeA 

where R is the scalar curvature of M, i.e. the quantity R = RijkiQ lk Q jl contracted from the 
curvature tensor R iklm = \{d k iQ im + d im Q k i - d km Qu - duQ km ) + Qn P (^kl T im ~ r Li r ij)> and 
Ag x = QijX7 l V J g x is the Laplace-Beltrami operator applied on g x . The curvature R requires 
only the knowledge of Qij(X) (and its derivatives), implying just one step of scalar products 
computations, i.e. integrations in L 2 (X). 

The reader who does not want to deal with differential geometry can however get rid of 
the covariant derivatives of Equation (|18p by replacing them by usual derivatives. This provides 
however a weaker bound. Indeed, using the expression (|17p of the Christoffel's symbols, some easy 
calculation provides < (Vy g x , V w g X ) Q lk Q jl = {dij g x ,d kl g x ) Q lk Q' 1 - a ijk a lmn Q il Qi m Q kn , 
with <Hjk = {dijg x ,d k g x ). 
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Therefore, (y ij g\,V u 9\) < (<% 9\,dki g\) Q lk 3 , from the positive definiteness of ij and 
Qij. Indeed, if we write w l]klmn — g tl gi m g kn ; anc i if we gather indices ijk and Iran in the two 
multi-indiceJ^l I = (i,j,k) and L = (l,m,n), W IL can be seen as a 2-D matrix in M p3xp3 . It 
is then easy to check that the P 3 eigenvectors of W IL are given by the P 3 combinations of the 
product of three of the P eigenvectors of Q lJ , i.e. the covariant vectors Q respecting the equation 
ij Cj = fi 5 lJ Q for a certain u = u(C,) > 0. The matrix Q lJ being positive, the eigenvalues of W IL 
are thus all positive, and W is positive. Therefore, ajW IL aL > for any tensor aj = a^. ■ 

C Proof of Lemma [2] 

Recall that we use the gradient ascent defined from the optimization function (f> r such that 
4> r+ i(k) = 4> r {k) + t r ^ r {k), for a sequence of positive step size t r increasing S u at each step, 
and for a step direction £ r (A) — I ^ S u (4> r (X))\~ 1 V l S u ((j) r (X)). From this definition, starting from 
k £ A, if lim r _» +00 <j) r (k) = k°° G A exists, then k°° is a point where V l S u (k co ) = for all i, since 
S u (0 r+l (k)) = S u (Mk)) + U \dS u (Mk))\ + 0(t 2 ). 

How may the trajectory 7*. = {(j) r (k) : r 6 N} contain a point A' satisfying (|14p ? Let us write 
7 r (s) for the geodesic linking 4> r {k) to Am, and define the distance function Cr = dg(\M,4'r(k))- 
We have thus 7 r (0) = 4> r {k) and 7 r (Cr) = Am, where Am is the global maximum of S u . 

Case 1. If Col'a {fy Gij{k) < 0, i.e. the optimization starts in the wrong direction. The function 
ip(s) = 5 u (7o(s)) is twice differentiable over [0, Co] an d for s close to zero, we have ^(0) > ip(s) 
since <//(0) = 8^)^(0) = \VS u (k)\ ^ 7o J '(0) G tJ {k) < 0. 

Since Aa^ is a global maximum of S u , ip(0) < ip((o) = S u (Xm)- Therefore, there exists a 
s* G (0, Co) that minimizes tp, i.e. tp'(s*) = with ?p(s*) < ip(0). For A* = 7o(s*), this implies 
that A* is critical since ip'(s*) = diS u (X*)'y' l (s*) = 0. From Lemma [TJ S u (\m) — S^A*) < 
7}\\u\\ 2 dg(\M, K) 2 {1 + IC) < 7}\\u\\ 2 dg(\M, k) 2 (1 + /C), since c?^(Am, A*) <dg(Ao,A:). Finally, for 
any A' G T fc , S U (X M ) - S U (X') < S U (X M ) - S u (k), and S U (X M ) - S U (X') < S U (X M ) - S U (X*) < 
\\\u\\ 2 dg{X M ,k) 2 {l+K), since S u (k) > S(A*)- 

Case 2. If CqIq {^>)Gij{k) = 0. We have right away 7 ! (0)diS u (k) = 0, and A; is a critical point 
in the direction Am- Lemma [1] applied on k gives S u (Xm) — S u (k) < \ ||u|| 2 dg(X\i, k) 2 (1 + /C). 
Since S^Am) — S U (X') < S u (Xm) — S u (k) for any A' G 7*., Equation (Q3|) holds. 

Case 3. If '(Q)Gij(k) > 0. Let us analyze the behavior of the distance function Cr- 

Let us introduce the function g?m(A) = dg(X]\f,X). As for the Euclidean space, it is easy to 
provj^l that V*(iA/(A) = — 7*(0) if 7 is the geodesic linking A = 7(0) to Am- Therefore, since 
Cr+i = dM(4>r+i(k)), a Taylor expansion of g?m(A) around A = 4> r (k) provides 

Cr+l = Cr ~ trC(k)^(0)0 t] (Mk)) + 0(t 2 ). (19) 

For r = 0, if to is sufficiently small, Ci < Co an d Cr nas either a local minima on a particular step 
r m > 0, or it decreases monotically and converges to a value Coo = linv^oo Cr < Co- 

16 This can be seen as a relabelling of the P 3 combinations of values for ijk into P 3 different one-number indices 

I. 

17 The interested reader will find a proof of this basic differential geometry result in the companion Technical 
Report [31]. 
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(i) C has a local minima £ rm < Co on r m > : Then, Cr m +i > Cr m an d, using (fl~9j) with some 
implicit dependences, Cr m +i — Cr m = — t Tm 7^(0)£r m £ij + 0(t 2 m ). Therefore, for a sufficiently 
small step t Tm , 7^ (0) £r m &j < and we are in the same hypothesis as Case 1 with the point 
A' = (j>r m ( k ) G % instead of fc. We obtain then 5 u (Am) - S U (X') < ±\\u\\ 2 dg(X M , A') 2 (1 +/C) < 
i||w|| 2 (i g (A A ./,A;) 2 (1 +/C), since dg(X M ,X') = Cr m < Co = dg(X M ,k). 

(ii) If Cr decreases monotically for r > : Since Cr > 0, the limit lim^oo Cr exists and 
converges to Coo < Co- However, it is not guaranteed that the sequence {<p r (k)} converges to 
a point of A. Fortunately, since for all r > 0, 4> r {k) remains in the finite volume Vq = {A £ 
A : cZa/(A) < dM(fc)}, this sequence is bounded in the finite dimensional space A. Therefore, 
from the Bolzano- Weierstrass theorem on the metric space (A,dg(-, •)), we can find a convergent 
subsequence G N : rj+i > r,;} such that lim^oo 4> ri {k) = feoo £ Vb- On this point, we will have 
V l S u (k 00 ) = for all i. So, fcoo is an umbilical point and, from Lemma [U 

S U (X M ) - Suikoo) < l\\u\\ 2 dgiXn^koc) 2 (1 +/C). 

From now on, we abuse notation and write 4> n {k) = 4>i(k). Since C^ = dg(XM,k 00 ) 2 < 
dg(XM,k) 2 = Cq, we can find a 5 > such that dg{XM-,k oa ) 2 + 5 < dg(XM,k) 2 . Therefore, 
because lim^oo S u (4>i(k)) = S u (k°°) by continuity of S u , and since S u {4>i{k)) increases monot- 
ically with i, there exists a i' > such that 5„(/c 00 ) - S(<f>v(k)) < l\\u\\ 2 5 (1 + /C). With 
X' = <t> v {k) G 7;, we finally get 5 u (Am)-5 u (A') < 5 u (A Af ) - S,,^ 00 ) + ±\\u\\ 2 6 (1 + /C), so that 
S u (Xm) — S U (X') < i||n|| 2 (dg(Ajvf, /c°°) 2 + <5) (1 + /C) < i||w|| 2 dg(A M , ^) 2 (1 + /C). This gives the 
result and concludes the proof. ■ 
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